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Abstract 

Let A be finite set equipped with a probability distribution P, and let M be a "mass" function 
on A. A characterization is given for the most efficient way in which A n can be covered using 
spheres of a fixed radius. A covering is a subset C n of A n with the property that most of the 
elements of A n are within some fixed distance from at least one element of C n , and "most of 
the elements" means a set whose probability is exponentially close to one (with respect to the 
product distribution P"). An efficient covering is one with small mass M n (C n ). With different 
choices for the geometry on A, this characterization gives various corollaries as special cases, 
including Marton's error-exponents theorem in lossy data compression, Hoeffding's optimal hy- 
pothesis testing exponents, and a new sharp converse to some measure concentration inequalities 
on discrete spaces. 
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1 Introduction 



Let A be a finite set and P a probability distribution on A. Suppose that the distance (or "dis- 
tortion") p(x,y) between any two points x,y £ A is measured by a given nonnegative function 
p : Ax A — > [0,oo), and for strings x\ = (xi,X2, ■ ■ ■ , x n ) and yf = (y%, 7J2, ■ ■ ■ , y n ) i n A n let 
Pn(xi,yi) be the corresponding coordinate-wise distance (or single-letter distortion measure) on 
A n x A n : 

1 n 

p n (xi,yi) = - y_\p{xi,yi). 

Since A is a finite set, the function p is bounded above by 

Omax= max p(x,y)= max p n (x1,y^). 

x,y£A x",y"£A n 

Without loss of generality we assume throughout that P(a) > for all a € A, and that 
for each a € A there exists a b E A with p(a,b) = (otherwise we may consider p'(x,y) = 
[p(x, y) - min zgA p(x, z)\ instead of p(x, y)). 

Given a D > 0, we want to cover "most" of A n using balls B(y™,D), where 

B(tf,D) = {x?€A n : p n (x^,y^)<D} 

is the closed ball of radius D centered at y\ G A n . To be precise, given a set C n C A n , we write 
[C n ] D for the D -blowup of C n , 

[Cn) D = \jB(y?,D). 

A D-covering of A n is a sequence of subsets C n of A n , n > 1, such that the P n -probability of the 
part of A n which is not covered by C n within distance D has exponentially small probability, 

Pr{"error"} = l- J P n ([C n ]J w 2' nE , (1) 

for some -E > 0. We are interested in "efficient" coverings of A n , that is, given a "mass function" 
M : A — > (0, 00), we want to find Z)-coverings {C n } that satisfy (|l]) and also have small mass 

n 

M n {c n )^ y. Mn (vi)= E II M ^)- 

y^eC n y?eC n i=l 

Clearly there is a trade-off between finding coverings {C n } with small mass, and coverings with a 
good (i.e., large) error-exponent E as in (|l]). Typically, the better the error-exponent, the larger 
the C n , and the bigger their mass would tend to be. 

Motivated, in part, by the following example and by the applications illustrated in the examples 
of the following section, in our main result we give a precise characterization of this trade-off. 
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Example: Measure Concentration on the Binary Cube. 

Consider the n-dimensional binary cube A n = {0, l} n . We measure distance on A n by the 
proportion of mismatches between two binary strings and y™, i.e., we take p n (x",yf) to be the 
Hamming distance, 

1 n 

^(sr.yfH-EWw}' xt,y?eA n , (2) 

1=1 

which also coincides with the normalized graph distance when A n is equipped with the nearest- 
neighbor graph structure. For simplicity, in this example we consider natural logarithms and 
exponentials. 

A well-known measure concentration inequality Prop. 2.1.1] 0, Thm. 3.5] gives a precise 
lower bound on the sphere-covering error probability of an arbitrary C n : For any D > 0, any 
product distribution P n on A n , and any C n C A n , 

~nD 2 /2 

Pr{ "error"} = l-P n ([C n ] D ) < — — . 

Therefore, if {C n } is any D-covering consisting of sets with P n (C n ) ~ e~ nr for some r > 0, then 
the union of the balls B(y™,D) centered at the points y\ € C n covers all of A n except for a set of 
probability no greater than 




It is then natural to ask, what is the best achievable error exponent among all D-coverings {C n } 
with probability no greater that ~ e _nr ? In other words, we are asking for small sets with the 
largest possible "boundary," sets C n with "volume" P n (C n ) no greater than e~ nT but whose D- 
blowups [C n ] D cover as much of A n as possible. As pointed in this question can be thought of 
as the opposite of the usual isoperimetric problem. 

Taking M = P in the general setting described above, we obtain the answer to this question as 
a corollary to our general result in the following section; see Corollary 3. 



2 Results 

Given any D > and any R E R, let E(R, D) denote the best achievable error-exponent among all 
-D-coverings with mass asymptotically bounded by 2 nR . Letting C(R) denote the collection of all 
sequences of subsets C n of A n with limsup n ~ log M n (C n ) < R, define, 

E(R,D)= sup liminf--log \l-P n {[C n ] n ) 
{c n }ec(R) n -*°° n L 

where 'log' denotes the logarithm taken to base 2. 
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A weaker version of this problem was recently considered in Q, where it was shown that the 
probability of error can only decrease to zero if R is greater than R(D; P, M), 

R(D;P,M)= inf \h(P x ,y \\P x Py) + E[log M(Y)]\, (4) 

(X,Y): X~P, Ep(X,Y)<D I J 

where the infimum is taken over all jointly distributed random variables (X, Y) such that X has 
distribution P and Ep(X,Y) < D, and Px,y denotes the joint distribution of X, Y, Py denotes 
the marginal distribution of Y ', and H(p\\u) denotes the relative entropy between two probability 
measures p and v on the same finite set S, 



Therefore, the error-exponent E(R,D) can only be nontrivial (i.e., nonzero) for R > R(D; P, M). 
Also note that any C n C A n has 

- log M n (C n ) < -\ogM n (A n )=\ogM(A). 
n n 

Hence, from now on we restrict attention to the range of interesting values for R between R(D; P, M) 
and iW = logAf(A). 

Theorem. For all D E [0, -D m ax) and all R(D; P, M) < R < R m3jX , the best achievable exponent of 
the error probability, among all .D-coverings {C n } with mass asymptotically bounded by 2 nR , is 

E(R, D) = E*(R, D) = inf H(Q\\P), 

Q:R(D;Q,M)>R 

where R(D; P, M) is defined in (Q) and H(Q\\P) denotes the relative entropy (or Kullback-Leibler 
divergence) between two distributions P and Q. 

Remarks. 

1. A slightly different error- exponent. Alternatively, we can define a version of the optimal 
error-exponent by considering only D-coverings {C n } with mass bounded by 2 nR for all n: 

E'(R,D) = hminf-ilogi min \l - P n ([C n } n )]\ . 

n^co n [C n :M n (C n )<2 nR L J J 

From the theorem it easily follows that E'(R,D) is also equal to E*(R,D) at all points R where 
E*(R,D) is continuous and, since it is nondecreasing in R, E*{R,D) is indeed continuous at all 
except countably many values of R. But in general it may fail to be continuous everywhere, as 
illustrated in the discussions by Marton [m] and Ahlswede [|J for the special case of lossy data 
compression (which corresponds to taking M(x) = 1; see Example 2 below). 
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2. Proof. The proof of the theorem is a modification of Marton's |7| original argument for the 
case of error-exponents in lossy data compression. The optimal sets {C n } achieving E*(R,D) are 
randomly generated, and they are universal in that their construction only depends on R, D, and 
M. Therefore, they achieve the optimal error-exponent simultaneously for all distributions P. 

Example 1: Hypothesis Testing. 

Let Pq and Pi be two probability distributions A with all positive probabilities. Suppose that 
the null hypothesis that a sample X™ = (X\, X2, ■ ■ ■ , X n ) of n independent observations comes 
from Pq is to be tested against the simple alternative that Xf comes from Pi. Any test between 
these two hypotheses is simply a decision region C n C A n : If X™ E C n we declare that X" ~ P", 
otherwise we declare X™ ~ Pq. The set C n is called the critical region, and the type-I and type-II 
probabilities of error associated with the test are, respectively, 

an = P n (C n ) and (5 n = P?(C c n ). 

Clearly we wish to have a n and j3 n both decrease to zero as fast as possible. In particular, we may 
ask how quickly j3 n can decay to zero if we require that a n decays exponentially at some rate r > 0, 
i.e., a n rs 2~ nr . In statistical terminology, we are asking for the fastest rate of decay of the type-II 
error probability among all tests with significance level a n < 2~ nr . 

Formally, we want to identify the best exponent of the error probability (3 n = 1 — Pf (C n ) among 
all C n with P n (C n ) < 2~ nr . Taking P = Pi, M = P , R = -r, and allowing no distortion, this 
question reduces exactly to the our earlier sphere-covering problem. [To be precise, allowing no 
distortion means we take D = with p(x,y) being Hamming distortion as in (§).] Accordingly, 
R(D; P, M) = P(0; Pi , Pq) turns out to be equal to —H{P\ ||Po), and from the theorem we immedi- 
ately obtain the following classical result of Hoeffding. Also see Thms. 9, 10] and [||, Ex.12, p. 43] 
for versions of this result in the information theory literature. 

Corollary 1. (Hypothesis Testing) || Let {C n } be an arbitrary sequence of tests with associated 
error probabilities a n and (3 n as above. Among all tests with 

lim sup — log a n < —r 

n—*oo Tl 

for some r G (0, P(Pi ||Po)), the fastest achievable asymptotic rate of decay of (3 n is 

lim--log/3 n = inf PT(Q||Pi). 

n^oo n Q:H(Q\\P )<r 

As mentioned earlier, the optimal decision regions C n in the Corollary are randomly generated. 
Therefore, although they do achieve asymptotically optimal performance, they are not optimal for 
finite n in the Neyman-Pearson sense. 
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Example 2: Lossy Data Compression. 

Suppose data X™ = (Xi,Xz, . . . ,X n ) is generated by a stationary, memoryless source, i.e., X™ 
are i.i.d. (independent and identically distributed) random variables, with distribution P on the 
finite alphabet A. The objective of lossy data compression is to find efficient representations y" £ A n 
for all source strings x™ £ A n . In particular, suppose that the maximum amount of distortion 
Pn{xi,Ui) that we are willing to tolerate between any source string x\ and its representation y\ 
is some D > 0, where {p n } is a family of single-letter distortion measures as in (1). Then the 
problem is to find an efficient codebook C n C A n such that for most of the source strings x™ there 
is a y™ G C n with p n (x^, yf) < D. 

Here, an efficient codebook C n is one that leads to good compression, i.e., one whose size is as 
small as possible. And, on the other hand, we also want to make sure that the probability that a 
source string cannot be represented by any element of C n with distortion D or less, is small. Taking 
M to be counting measure (M(x) = 1 for all x G A), the mass M n {C n ) of the codebook becomes 
its size \C n \, and the problem of finding a good codebook reduces to the earlier sphere-covering 
question. Accordingly, the rate-function R(D; P; M) reduces to Shannon's rate-distortion function 
R(D;P), and the theorem yields Marton's error-exponents result. 

Corollary 2. (Lossy Data Compression) M Let D > be a given distortion level, and R(D; P) < 
R < log |^4|- Among all sequences of codebooks {C n } with asymptotic rate no greater than 
R bits/symbol, 

lim sup — log \ C n | < R, 

n^oo Tl 

the fastest achievable asymptotic rate of decay of the probability of error is 

lim -- log [l - P n ([C n ] D )} = inf ^ H(Q\\P). 



n— »oo n 



Q:R(D;Q)>R 



Example 3: Measure Concentration on the Binary Cube. 

Consider again the setting of the example described in the introduction. There we asked for 
the best achievable error exponent among all Z)-coverings {C n } with probability no greater that 
~ e~ nr . Taking M = P in the theorem, we obtain the answer to this question in the following 
Corollary. Let H e (P\\Q) denote the relative entropy expressed in nats rather than bits, H e (P\\Q) = 
(log e 2)H(P\\Q), and similarly write R e (D; P, M) = (log e 2)R(D; P, M). 

Corollary 3. (Converse Measure Concentration) Let D > and < r < —R e (D;P,P). Among 
all .D-coverings {C n } with 

lim sup - log e P n (C n ) < -r, 

n— >oo Ti- 



the fastest achievable asymptotic rate of decay of the probability of error is 

'l-P n ([C n }J =£*(r,D), 



lim log 
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where 



£*(r,D) 



inf 

Q:R e (D;Q,P)>- 



H e (Q\\P). 



Although the exponent £*(r, D) above is not as explicit as — r) in (y), it is easy to evaluate 
numerically and it contains much more useful information. For example, Figure 1 shows the graph 
of £ *(r, D) as a function of r, for D = 0.3, P being the Bernoulli(0.4 ) distribution, and r running 
over the range r 6 (0.6109,0.6393) where £*(r,D) is nontrivial (i.e., finite and nonzero). In this 

case, (||) is only useful when (-^ r) is positive, i.e., for r £ (0,0.045): There (||) says that, 

whenever P n {C n ) ~ e~ nr for some r G (0, 0.045), the probability of error decays exponentially fast. 
But in that range, and in fact for all r up to ~ 0.61, we have £ *(r, D) = oo so there are sets C n 



with P n (C n 



and probability of error decaying sitper-exponentially fast. Moreover, in the 



range r € (0.6109,0.6393) where £*{r,D) is nontrivial, we can choose C n with P n {C n 
Pr{ "error"} « e - n£ ^ r ' D \ 



and 



0.06 



0.04 



0.02 - 



0.59 




0.65 



Figure 1: Graph of the error-exponent function £*(r,D) in Corollary 3 as a function of r, for 
D = 0.3 and P(l) = 0.4. Note that £*(r, D) is infinite for all r £ (0, 0.6109), and that it is zero for 
r > 0.6393. 



Finally we remark that the "extremal" sets in the classical isoperimetric problem, namely, 
those C n that achieve equality in (0), are very different from the extremal sets in Corollary 3. 
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The former are well-known to be Hamming balls B n centered at n = (0,0, ... ,0) G A n , B n = 
{x"l : / o n (x^,0 n ) < 5} (see §@, p. 174] Sec. 2.3]), while the latter are collections of strings y\ 
randomly selected from a collection of suitable strings. 

Extensions. 

1. Different alphabets. Although we assumed from the start that p(x, y) is a distortion measure 
on A x A, it is straightforward to generalize the main result as well as the subsequent discussion 
above to the case when p(x, y) is a distortion measure between the "source" alphabet A and a 
different ("reproduction") alphabet A, as long as it is still the case that for each a G A there exists 
a b G A with p(a, b) = 0. The necessary modifications to the statements and proofs follow exactly 
as in the case of Marton's result; see ||, Sec. 2.4]. 

2. Strong converse. As mentioned earlier, the theorem is stated only for values of R above 
R(D; P, M) since we trivially have E(R, D) = for R < R(D; P, M); see g Thm. 1]. In that range 
it is also possible to prove a "strong converse" showing that, not only E(R, D) = 0, but in fact the 
probability of error goes to one exponentially fast with a certain rate. 



3 Proof 

First we prove the converse part of the theorem, asserting that E(R, D) < E*(R, D). 

Note that the rate-function R(D; P, M) defined in (H) is jointly uniformly continuous in D > 
and P; this can be easily seen to be the case by arguing along the lines of the proof of || 
Lemma 2.2.2] for the rate-distortion function R(D;P). Now let {C n } be an arbitrary .D-covering 
with {C n } G C(R). Take any Q on A such that R(D;Q,AI) > R (if no such Q exists then the 
claim is trivially true), and let 5 > be such that R(D;Q, M) > R + 5. Since {C n } G C(R), we 
have log M n {C n ) < n{R + 5/2), eventually, and by the continuity of R{D; Q, M) in D we can find 
an T) > small enough so that 

log M n {C n ) < n{R + 5/2) < nR(D + r/; Q,M), eventually. 

Therefore, by the "weak converse" in || Thm. 1], we must also have 



Enn 



niin p n {X?,y? 



> D + 77, eventually, (5) 



where X™ denote n i.i.d. random variables with distribution Q n . Writing 

Z n = min p„(A7,y™), 

y" ec„ 

the bound in equation (||) implies that 

D + t? < E[Z n ) < D Q n {Z n <D) + D max Q n {Z n > D) 
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i.e., 



Q n (Z n >D)> 



From Stein's lemma Cor. 1.1.2] we also know that, for any P and any e > 0, 



lim — log 

n— >oo n 



min P n (B n ) 

B n cA n : Q n (B„)>e 



-D(Q\\P). 



Taking e = rj/ (D s 



yields 



D) > and applying this to the events 



B, 



A 



lim inf — log 

n — >oo n 



l-P n ([C n ] D ) >-D(Q\\P), 



and since this holds for all Q with R(D; Q, M) > R, we obtain 

limsup -- log [l - P n {[C n ] D )] < E*(R, D). 

n—*oo Tl L 

Finally, since {C n } £ C(R) was arbitrary, this establishes that E(R,D) < E*(R,D), as required. 

To prove the direct part of the theorem, asserting the existence of a D-covering {C n } £ C(R) 
such that 

lim inf -- log [l - P n {[C n ] D )] > E*(R,D), 

we follow the same outline as in the proof of the direct part of [||, Thm. 2.4.5]. 

Using the joint uniform continuity of R(D; P, M) in D > and P, the proof of the type-covering 
lemma j^, Lemma 2.4.1] can be generalized to the corresponding statement with R(D; P, M) in place 
of R(D;P). The main new observation here is that, since all the elements y" of the covering set 
B are drawn from the set Try-*] of y*-typical strings, where (X*,Y*) achieve the infimum in the 
definition (|) of R(D;P,M), their mass M n (yf) satisfies 



~logM n (y?) <E[\ogM{Y*)]+5 n 
n 



^logM(y) 



L y 



where the sequence S n — > as n — > oo. 

Finally, following the same steps as in the proof of the direct part of Thm. 2.4.5] and 
replacing R(D; P) by R(D; P, M), we obtain the existence of a .D-covering {C n } € C(R) with error 
exponent no worse than E*(R,D) — 5, where 5 > is an arbitrary constant. This proves that 
E(R, D) > E*(R, D), and completes the proof. □ 
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